# Multimodal Understanding
Nvidia.cosmos Reason1 7B GGUF
Cosmos-Reason1-7B is a 7B-parameter foundational model released by NVIDIA, specializing in image-to-text tasks.
Large Language Model
N
DevQuasar
287
1
Devstral Small Vision 2505 GGUF
Apache-2.0
Vision encoder based on Mistral Small model, supports image-text generation tasks, compatible with llama.cpp framework
Image-to-Text
D
ngxson
777
20
Magma 8B GGUF
MIT
Magma-8B is an image-text-to-text conversion model based on the GGUF format, suitable for multimodal task processing.
Image-to-Text
M
Mungert
545
1
Typhoon Ocr 7b
A vision-language model specifically designed for Thai-English real-world document parsing, based on the Qwen2.5-VL-Instruction framework
Image-to-Text
Transformers Supports Multiple Languages

T
scb10x
126
9
Qwen Qwen2.5 VL 72B Instruct GGUF
Other
A quantized version of the Qwen2.5-VL-72B-Instruct multimodal large language model, supporting image-text-to-text tasks, suitable for various quantization levels from high precision to low memory requirements.
Text-to-Image English
Q
bartowski
1,336
1
Qwen Qwen2.5 VL 7B Instruct GGUF
Apache-2.0
A quantized version of Qwen2.5-VL-7B-Instruct, using llama.cpp for quantization, supporting multimodal tasks such as image-to-text conversion.
Text-to-Image English
Q
bartowski
2,056
2
Vilt Finetuned 100
Apache-2.0
A vision-language model fine-tuned on VQA datasets based on the ViLT-B32-MLM model
Text-to-Image
Transformers

V
bangbrecho
15
0
TEMPURA Qwen2.5 VL 3B S1
TEMPURA is a video temporal understanding framework combining causal reasoning with fine-grained temporal segmentation, enhancing video event comprehension through two-stage training
Video-to-Text
Transformers

T
andaba
16
0
Qwen2.5 Vl 7b Cam Motion Preview
Other
A camera motion analysis model fine-tuned based on Qwen2.5-VL-7B-Instruct, focusing on camera motion classification in videos and video-text retrieval tasks
Video-to-Text
Transformers

Q
chancharikm
1,456
10
Gemma 3 12b It Qat Int4 GGUF
Gemma 3 is Google's lightweight open model series based on Gemini technology. The 12B version employs Quantization-Aware Training (QAT) technology, supports multimodal input, and features a 128K context window.
Text-to-Image
G
unsloth
1,921
3
Gemma 3 27b It Qat GGUF
Gemma 3 is a lightweight open model series built by Google based on Gemini technology, supporting multimodal input and text output, featuring a 128K large context window and support for 140+ languages.
Text-to-Image English
G
unsloth
2,683
3
Gemma 3 12b It Qat Int4
Gemma 3 is a lightweight open model series from Google, built on the research and technology used to create Gemini models. The 12B version is an instruction-tuned multimodal model supporting text and image inputs to generate text outputs.
Image-to-Text
Transformers

G
unsloth
78
1
Blip Gqa Ft
MIT
A fine-tuned vision-language model based on Salesforce/blip2-opt-2.7b for visual question answering tasks
Text-to-Image
Transformers

B
phucd
29
0
Blip Custom Captioning
Bsd-3-clause
BLIP is a unified vision-language pretraining framework, excelling in vision-language tasks such as image caption generation
Image-to-Text
B
hiteshsatwani
78
0
Internvl3 8B 6bit
Other
InternVL3-8B-6bit is a vision-language model converted to MLX format, supporting multilingual image-text-to-text tasks.
Image-to-Text
Transformers Other

I
mlx-community
70
1
Gemma 3 12B It Qat GGUF
Gemma 3 12B IT is a large language model developed by Google, supporting multimodal input and long-context processing.
Image-to-Text
G
lmstudio-community
36.65k
4
Gemma 3 4B It Qat GGUF
The Gemma 3 4B IT model by Google supports multimodal input and long-context processing, suitable for text generation and image understanding tasks.
Image-to-Text
G
lmstudio-community
46.55k
10
Gemma 3 27b It Qat 3bit
Other
This model is a 3-bit quantized version converted from google/gemma-3-27b-it-qat-q4_0-unquantized to the MLX format, suitable for image-to-text tasks.
Image-to-Text
Transformers Other

G
mlx-community
197
2
Gemma 3 27b It Qat 4bit
Other
Gemma 3 27B IT QAT 4bit is an MLX-format model converted from Google's original model, supporting image-to-text tasks.
Image-to-Text
Transformers Other

G
mlx-community
2,200
12
Gemma 3 4b It GPTQ 4b 128g
INT4 quantized version based on the gemma-3-4b-it model, significantly reducing storage and computational resource requirements
Image-to-Text
Transformers

G
ISTA-DASLab
502
2
Gemma 3 1b It Qat Q4 0 Unquantized
Gemma 3 is a lightweight open-source multimodal model series developed by Google, built on Gemini technology, supporting text and image inputs with text outputs. The 1B version has undergone instruction tuning and quantization-aware training (QAT), making it suitable for deployment in resource-constrained environments.
Image-to-Text
Transformers

G
google
246
4
Gemma 3 12b It Qat Q4 0 Unquantized
Gemma 3 is Google's lightweight open-source multimodal model series based on Gemini technology, supporting text and image inputs with text outputs. The 12B version undergoes instruction tuning and quantization-aware training (QAT), making it suitable for deployment in resource-limited environments.
Text-to-Image
Transformers

G
google
1,159
10
Llama 4 Scout 17B 16E Linearized Bnb Nf4 Bf16
Other
Llama 4 Scout is a 17-billion-parameter Mixture of Experts (MoE) model released by Meta, supporting multilingual text and image understanding with a linearized expert module design for PEFT/LoRA compatibility.
Multimodal Fusion
Transformers Supports Multiple Languages

L
axolotl-quants
6,861
3
Gemma 3 4b It Qat Q4 0 GGUF
Gemma is a family of lightweight, cutting-edge open models introduced by Google, built on the same research and technology as the Gemini models. Supports text and image inputs and generates text outputs.
Text-to-Image
G
Mungert
713
2
Gemma 3 27b It Qat Autoawq
Gemma 3 is a lightweight, cutting-edge open model series from Google, built on the same technology as Gemini, supporting multimodal input (text/image) and text output. The 27B version significantly reduces memory requirements through quantization-aware training.
Image-to-Text
Safetensors
G
gaunernst
789
4
Gemma 3 12b It Qat Autoawq
Gemma 3 is Google's lightweight open model series based on Gemini technology, supporting multimodal input and text output.
Image-to-Text
Safetensors
G
gaunernst
498
3
Gemma 3 27b It Qat Q4 0 Gguf
Gemma 3 is a lightweight open-source multimodal model series by Google, supporting text and image inputs with text generation capabilities. This version is a 27B parameter instruction-tuned model using quantization-aware training, offering lower memory requirements while maintaining near-original quality.
Image-to-Text
G
vinimuchulski
4,674
6
Gemma 3 12b It Qat Q4 0 Gguf
Gemma 3 is a lightweight open model built by Google based on Gemini technology, supporting text and image inputs to generate text outputs. The 12B version is instruction-tuned and suitable for various generation and comprehension tasks.
Image-to-Text
G
vinimuchulski
1,860
4
Llama 4 Maverick 17B 128E Instruct
Other
Llama 4 Maverick is a 17-billion-parameter multimodal AI model developed by Meta, featuring a Mixture of Experts (MoE) architecture, supporting multilingual text and image understanding with 128 expert modules.
Large Language Model
Transformers Supports Multiple Languages

L
meta-llama
87.79k
309
Qwen2.5 VL 7B Instruct Q8 0 GGUF
Apache-2.0
This model is a GGUF-format conversion of Qwen2.5-VL-7B-Instruct, supporting multimodal tasks and applicable to image and text interaction processing.
Text-to-Image English
Q
cxtb
72
1
Gemma 3 27b It Int4 Gguf
Gemma 3 is a lightweight cutting-edge open model family from Google, built on the same research technology as Gemini models. Supports text/image input and text output, offering both pretrained and instruction-tuned weight versions.
Image-to-Text
G
gaunernst
232
3
Gemma 3 12b It Int4 Gguf
Gemma 3 is a lightweight multimodal open model from Google that supports text and image inputs with text outputs, featuring a 128K large context window and support for 140+ languages.
Image-to-Text
G
gaunernst
107
1
Sapnous VR 6B
Apache-2.0
Sapnous-6B is an advanced vision-language model that enhances perception and understanding of the world through powerful multimodal capabilities.
Image-to-Text
Transformers English

S
Sapnous-AI
261
5
Openvlthinker 7B
Apache-2.0
OpenVLThinker-7B is a vision-language reasoning model specifically designed for multimodal tasks, with particular optimization for solving visual mathematical problems.
Image-to-Text
Transformers

O
ydeng9
594
16
Gemma 3 12b It Int4 Awq
Gemma is Google's lightweight cutting-edge open-source model family, built using the same research technology as Gemini models. Gemma 3 is a multimodal model supporting text/image input and text output.
Image-to-Text
Transformers

G
gaunernst
4,658
9
Timezero ActivityNet 7B
TimeZero is a reasoning-guided large-scale vision-language model (LVLM) specifically designed for temporal video grounding (TVG) tasks, achieving dynamic video-language relationship analysis through reinforcement learning methods.
Video-to-Text
Transformers

T
wwwyyy
142
1
Timezero Charades 7B
TimeZero is a reasoning-guided large vision-language model (LVLM) specifically designed for temporal video grounding (TVG) tasks. It identifies temporal segments in videos corresponding to natural language queries through reinforcement learning methods.
Video-to-Text
Transformers

T
wwwyyy
183
0
Gemma 3 12b Pt Unsloth Bnb 4bit
Gemma 3 is a lightweight, advanced open model series launched by Google, built on the same research technology as Gemini, supporting multimodal input and text output.
Text-to-Image
Transformers English

G
unsloth
1,286
1
Gemma 3 4b Pt Qat Q4 0 Gguf
Gemma 3 is a lightweight open model series launched by Google, built on the same technology as Gemini, supporting multimodal input and text output.
Image-to-Text
G
google
912
16
Gemma 3 27b It Mlx
This is an MLX-converted version of the Google Gemma 3 27B IT model, supporting image-text-to-text tasks.
Image-to-Text
Transformers

G
stephenwalker
24
1
- 1
- 2
- 3
- 4
Featured Recommended AI Models